Add Anima modular pipeline#13732
Conversation
|
I'm the creator of Anima. I've run this and confirmed it looks good in terms of correct model implementation and producing images that match the quality of other implementations. I'll let the Diffusers team comment on any Diffusers-specific questions or concerns. |
|
@rmatif thanks a lot for this, can you please post an example code and a couple of generations with this? |
|
@claude can you do a review? |
|
Claude finished @sayakpaul's task in 4m 49s —— View job Review of PR #13732: Add Anima pipeline
Overall this is a clean PR that reuses existing infra well (Cosmos transformer, QwenImage VAE, FlowMatchEuler scheduler). The code is readable and the LoRA conversion logic is well-structured. A few issues to fix before merge: Blocking issues1.
|
|
thanks for the PR! super excited to have Anima in diffusers on high-level I have two feedbacks/questions: (1) Can we explore having an (2) Can we support this pipeline from Modular Diffusers directly? Given the very active community ecosystem and the continuous training/release style, Modular is a better fit — see the docs here: https://huggingface.co/docs/diffusers/main/en/modular_diffusers/overview. since you've already implemented in the standard pipeline, it would take a little refactor - happy to provide more info if interseted, we have pretty good docs for AI agents on this and I can point you to reference PRs as well |
|
Regarding (1), what if we just subclassed the Cosmos DiT, like ComfyUI does? The main reason to try to avoid duplicating code is that Anima's DiT architecture is identical to the Cosmos-Predict2 DiT. The only change is the LLM Adapter module (called AnimaTextConditioner in this PR). In ComfyUI the adapter lives as a submodule of the DiT for convenience, but it's not called in the forward() method since it only needs to run once for the entire diffusion process. So regardless of the structure, the pipeline code is going to be calling the adapter "manually" only once. |
This isn't something we do in diffusers — all our models are self-contained and inherit from
ohh, we usually include text condition layers in forward as well for simplicity — the performance tradeoff is typically non-significant. But if that's not the case for Anima, keeping it as a separate component like this PR would makes sense |
|
@yiyixuxu My preference is also to keep The main reason is that Anima’s DiT is not a new architecture. The denoiser weights and forward path are the Cosmos Predict2 DiT, the Anima-specific part is the LLM adapter that turns Qwen3 hidden states + T5 token ids into the Since subclassing The checkpoint conversion does split If the preference is still to make an And agree that Modular diffusers is a good fit for Anima, Would it be okay to handle Modular support in a follow-up PR? |
It's not the case for Anima. The LLM Adapter is 6 transformer layers with both self- and cross-attention, which is heavier than what is typical in most models (often just a single MLP projection layer). Anima basically has a mini text encoder that is converting from Qwen3 embedding space to T5XXL embedding space for input to the model. It's been a while since I ran the numbers, and I didn't write it down, but I recall the LLM Adapter as being ~10% of the full forward pass. IMO this is enough to warrant being called just once for the entire diffusion loop (and is what ComfyUI does as well). |
|
sounds good to keep Can we support Anima only through Modular Diffusers, rather than maintaining both? We've been supporting new pipelines through both, but now that Modular is officially released we're looking to shift new pipelines to modular-only. Especially since we expect Anima to be be a very actively developed model, both from the author and the community, the maintenance cost from the standard pipeline could be quite high for us. |
Fair enough, I moved everything into Modular. Looking forward to your review Here’s the updated example: import torch
from diffusers import AnimaAutoBlocks
from diffusers.guiders import ClassifierFreeGuidance
pipe = AnimaAutoBlocks().init_pipeline("mrfatso/anima-preview3-diffusers")
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.update_components(guider=ClassifierFreeGuidance(guidance_scale=4.0))
pipe.to("cuda")
prompt = (
"masterpiece, best quality, very aesthetic, absurdres, 1girl, solo, silver hair, blue eyes, "
"long hair, school uniform, sailor collar, cherry blossoms, petals, spring, soft lighting, "
"looking at viewer, upper body, detailed background"
)
negative_prompt = (
"worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, "
"sepia, signature, artist name"
)
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=1024,
height=1024,
num_inference_steps=25,
generator=torch.Generator(device="cuda").manual_seed(12341),
).images[0]
image.save("anima.png") |
yiyixuxu
left a comment
There was a problem hiding this comment.
i left one comment,
overall looks good to me, thanks for working on this
| ) | ||
|
|
||
|
|
||
| class AnimaTextConditionerBlock(nn.Module): |
There was a problem hiding this comment.
ohhh but it is not a transformer. I think we have a couple of options:
- Create a new folder under models/ for non-standard pipeline components.
- Follow the same convention as in standard pipelines, host it under modular_pipelines/anima/text_conditioner.py. it requires a small change in modular
from_pretrained()to work since the model is pipeline local and won't be importable on top-level
want to hear everyone's thoughts!
I think maybe it's time for (1) because it is just strange that we host model components under pipeline folders. the pipeline-local model structure was designed at the time we use same UNet and vae for every pipeline. A lot has changed since — all our models now follow the single-file pattern and pretty much every model is pipeline-specific. maybe we don't have to keep that distinction anymore


What does this PR do?
Adds modular-only support for Anima, a text-to-image model built on top of the Cosmos Predict2 DiT architecture.
This PR adds:
AnimaModularPipelineandAnimaAutoBlocksAnimaTextConditionerConverted weights:
https://huggingface.co/mrfatso/anima-preview3-diffusers
Fixes #13067
cc @tdrussell
Testing
Tested the converted checkpoint locally with txt2img generation and LoRA loading
Before submitting
Who can review?
@yiyixuxu @asomoza